[OTEL] Add OpenTelemetry observability support by royischoss · Pull Request #285 · mlrun/ce

royischoss · 2026-04-05T15:02:36Z

Adds OTel-based observability to MLRun CE with automatic Python instrumentation, deployment-mode metrics collection, and Prometheus integration.
https://iguazio.atlassian.net/browse/CEML-685

What's implemented

OTel operator sub-chart

opentelemetry-operator v0.78.1 added as an optional dependency
crds.create: true — the sub-chart manages OTel CRDs directly; full CRD YAML (~1.6 MB) compresses to ~160 KB gzipped in the Helm release Secret, well within the 3 MB Kubernetes API limit
Webhook namespace selector scoped to namespaces labeled opentelemetry.io/inject=enabled — avoids injecting into unrelated namespaces

templates/opentelemetry/

namespace-label.yaml — post-install,post-upgrade hook (weight -10) that labels and annotates the release namespace:

Label: opentelemetry.io/inject=enabled — activates the operator webhook for this namespace
Annotation: instrumentation.opentelemetry.io/inject-python points to the Instrumentation CR — namespace-wide Python auto-instrumentation for all pods

otel-cr-installer.yaml — post-install,post-upgrade hook (weight 10) that:

Polls kubectl wait --for=condition=Established on the three OTel CRDs before applying CRs (sub-chart renders CRDs as regular templates, not in crds/, so they may not exist when the hook fires)
Applies the OpenTelemetryCollector CR with a bounded retry loop (max_retries=30, 30x5s=150s) until the operator webhook is ready
Applies the Instrumentation CR the same way — both CRs are written to temp files before the retry loops (avoids a bash heredoc-in-until pitfall where the heredoc is consumed on the first condition evaluation)
Skips kubectl rollout restart on upgrade if pods already have the OTel init container injected — avoids unnecessary pod restarts on every helm upgrade

rbac.yaml — regular Helm-managed resources (no hook annotations): ServiceAccount, ClusterRole, ClusterRoleBinding, Role, RoleBinding for the installer job. Being regular resources means they are cleaned up on helm uninstall.

Metrics pipeline: push model (OTLP to Prometheus)

OpenTelemetryCollector CR: deployment mode, single collector pod per namespace
Receives OTLP from all auto-instrumented workloads on gRPC :4317 and HTTP :4318
Exports metrics by pushing to Prometheus via otlphttp/prometheus at http://prometheus-operated:9090/api/v1/otlp
Prometheus configured with --enable-feature=otlp-write-receiver and --web.enable-otlp-receiver

Instrumentation CR (Python auto-instrumentation)

OTEL_SERVICE_NAME set via fieldRef: metadata.name — each pod reports its own pod name as the service name
aws_lambda instrumentor disabled to suppress irrelevant Lambda warnings
OTEL_LOGS_EXPORTER=none — traces and metrics only

Nuclio function pods

mlrun.io/otel: "true" label and OTEL_SERVICE_NAME env added to all Nuclio function pods via functionDefaults. These are intentionally always set; the label only triggers injection if the operator webhook is active in the namespace.

MLRun API — PYTHONPATH workaround

mlrun.api.extraEnvKeyValue.PYTHONPATH explicitly exposes the Docker image PYTHONPATH as a K8s env var so the OTel operator injection (PYTHONPATH=/otel-auto-instrumentation-python:$(PYTHONPATH)) can expand $(PYTHONPATH) correctly — K8s env var substitution cannot see Docker image ENV vars
The cleaner solution (opt-out annotation on the API pod) is not possible: the upstream mlrun chart hardcodes pod annotations with no podAnnotations values key

tests/package.sh

Reads opentelemetry-operator version dynamically from requirements.yaml (no hardcoded version string)
Patches the upstream chart values.schema.json (featureGates.examples string to array) for Helm v4 metaschema compliance

Admin / non-admin split

Admin values: OTel operator enabled with namespace-selector webhook; CRs disabled
Non-admin values: operator disabled; collector + instrumentation CRs enabled

Generated with Claude Code

…R_SPACE and MLRUN_MODEL_ENDPOINT_MONITORING__STORE_PREFIXES__MONITORING_APPLICATION plus removes MLRUN_MODEL_ENDPOINT_MONITORING__ENDPOINT_STORE_CONNECTION

# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock # charts/mlrun-ce/values.yaml # tests/kind-test.sh

…ion accordingly. add request and limit for crdReadinessJob and namespaceLabelJob

# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock

…, change naming for otel metrics using metadata.name fieldRef

… empty templates, kubectl image - Move hardcoded OTel collector pipeline config into values.yaml under opentelemetry.collector.config — users can now override receivers, processors, exporters without forking the chart. Prometheus endpoint uses short DNS (prometheus-operated:9090) removing namespace interpolation from the helper. - Add opentelemetry.kubectlImage to values.yaml (defaults to bitnami/kubectl:latest) and reference it in both crd-readiness-job.yaml and namespace-label.yaml instead of hardcoded tag. - Fix namespace-label.yaml: replace indent with nindent for correct YAML formatting; change restartPolicy: Never to OnFailure so the job retries on transient failures. - Delete empty collector.yaml and instrumentation.yaml template files that generated no resources and were misleading. Move their documentation comment into crd-readiness-job.yaml where the actual CR creation happens. - Replace 50-line hardcoded collector manifest in _helpers.tpl with toYaml .Values.opentelemetry.collector.config | nindent 4.

…ilience, package.sh Issue 2 — Bounded retry in CR installer (High) The until kubectl apply loop in otel-cr-installer.yaml had no exit condition. On permanent failure (crashed operator, image pull error) it would spin silently until the 300s hook-timeout killed it with a cryptic deadline error. Added a max_retries=30 counter (30 × 5s = 150s max) with a clear exit 1 and operator log pointer on exhaustion. Applied to both the Collector and Instrumentation CR apply loops. Issue 3 — Skip rollout restart on upgrade (Medium) The otel-cr-installer hook runs on both post-install and post-upgrade. Previously it always restarted all mlrun.io/otel=true labeled pods, causing unnecessary Jupyter + Nuclio churn on every helm upgrade even when OTel was already running. Added an init container presence check: if pods already have the opentelemetry init container injected, the restart is skipped. Only restarts on fresh installs (where pods started before the webhook was ready). Issue 4 — Document PYTHONPATH workaround (Medium) The mlrun.api.extraEnvKeyValue.PYTHONPATH entry exists to prevent OTel's $(PYTHONPATH) expansion from resolving to an empty string. The ideal fix is adding instrumentation.opentelemetry.io/inject-python: "false" as a pod annotation to opt the MLRun API out of injection — but the upstream mlrun chart template hardcodes pod annotations with no podAnnotations values key. Improved the comment to document this constraint clearly so the next reader doesn't re-investigate. Issue 5 — Document Prometheus OTLP receiver (Medium) enableFeatures: [otlp-write-receiver] and --web.enable-otlp-receiver are always set on the Prometheus sub-chart regardless of opentelemetry.collector.enabled. There's no Helm-native way to conditionally configure sub-chart values. Added a comment explaining the always-on behavior and why the attack surface increase is negligible (OTLP endpoint shares the already-unauthenticated port 9090 with the Prometheus query API). Issue 6 — RBAC resources leak on uninstall (Medium) All 5 RBAC resources (ClusterRole + ClusterRoleBinding for otel-crd-reader, ServiceAccount + Role + RoleBinding for otel-cr-creator) were annotated as Helm hooks with before-hook-creation delete policy only. helm uninstall doesn't run hooks, so these cluster-scoped resources were never deleted — confirmed on the test cluster where old my-mlrun-otel-crd-reader from 11 days prior was still present. Removed all hook annotations; resources are now regular Helm-managed and deleted on uninstall. namespace-label.yaml moved from pre-install,pre-upgrade to post-install,post-upgrade (weight -10) so the regular RBAC resources exist before the labeling job runs. Issue 7 — package.sh version hardcoded + CRD slimming broken (Medium) Two bugs in tests/package.sh: 1. The OTel operator .tgz filename was hardcoded as opentelemetry-operator-0.78.1.tgz in two places — version bumps would silently skip the schema patch. Changed to read the version dynamically from requirements.yaml via Python. 2. The CRD slimming step replaced conf/crds/ templates with empty stubs (crds.create: false). This was broken when we switched to Option B (crds.create: true) — stubs rendered nothing, so CRDs were never created on a fresh packaged-chart install. Removed the slimming step entirely. The full CRD YAML (~1.6 MB) compresses to ~160 KB gzipped in the Helm release Secret, well within the 3 MB Kubernetes limit. Bash heredoc-in-until fix (discovered during cluster validation) The Instrumentation CR apply originally used until cat <<'EOF' | kubectl apply -f -; do — a heredoc inside an until condition. This is unreliable: the heredoc content is consumed on the first evaluation and subsequent retries pipe empty input to kubectl. Changed to write to /tmp/instrumentation-cr.yaml first (same pattern as the Collector CR), which is unconditionally safe across bash versions. Tests Added 7 new test cases to tests/helm-template-test.sh that catch all of the above regressions locally before cluster install: - RBAC: no helm.sh/hook annotations on any resource (catches hook leak) - RBAC: no before-hook-creation delete policy (same) - Namespace-label: uses post-install,post-upgrade not pre-install (catches SA missing at hook time) - CR installer: has max_retries, retries=0, and exit 1 (catches infinite loop) - CR installer: uses collector-cr.yaml and instrumentation-cr.yaml temp files (catches heredoc-in-until) - CR installer: has init container check and skip message (catches unnecessary upgrade restarts)

davesh0812

LGTM

royischoss added 14 commits February 24, 2026 13:40

Remove v3io from MLRUN_MODEL_ENDPOINT_MONITORING__STORE_PREFIXES__USE…

a4ee752

…R_SPACE and MLRUN_MODEL_ENDPOINT_MONITORING__STORE_PREFIXES__MONITORING_APPLICATION plus removes MLRUN_MODEL_ENDPOINT_MONITORING__ENDPOINT_STORE_CONNECTION

fix formating

ce7601e

fix version and remove deprecated dsn

061e2a4

adding otel to ce

0b2243f

Merge remote-tracking branch 'origin/development' into ceml-641

08c0be1

# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock # charts/mlrun-ce/values.yaml # tests/kind-test.sh

fix requirements.lock

40baa80

fixes

ff914ac

Merge remote-tracking branch 'origin/development' into ceml-641

f121089

works

78b175a

Otel with collector works well

73ba287

bump chart version

d71d893

Merge remote-tracking branch 'origin/development' into ceml-641

6889bb2

bump chart version

8b0be84

fix lint

aedbbcf

royischoss commented Apr 9, 2026

View reviewed changes

Comment thread charts/mlrun-ce/values.yaml

documentation fixes

3cc4d46

royischoss marked this pull request as ready for review April 9, 2026 07:50

royischoss added 3 commits April 9, 2026 17:39

fixes

f7cca6a

fixes

6f6193c

change method to push to prometheus

4f56e6d

royischoss requested a review from davesh0812 April 12, 2026 10:16

GiladShapira94 reviewed Apr 12, 2026

View reviewed changes

Comment thread charts/mlrun-ce/templates/_helpers.tpl Outdated

Comment thread charts/mlrun-ce/README.md Outdated

royischoss added 8 commits April 14, 2026 10:15

change method to push to prometheus

3aa7420

remove labeling s3 and TimescaleDB fix jupyter bug. update documentat…

36c4b3f

…ion accordingly. add request and limit for crdReadinessJob and namespaceLabelJob

another jupyter timing fix

a5e71ef

remove redundant loop for crds check

b15ee61

Merge remote-tracking branch 'origin/development' into ceml-641

a76e23e

# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock

fix requirements.lock

4f2bda3

fix rc version

657cab3

fix pin kubectl version in jobs, fix documentation for crds readiness…

73c6aa1

…, change naming for otel metrics using metadata.name fieldRef

royischoss added 3 commits April 15, 2026 11:45

fix pin kubectl version in jobs, fix documentation for crds readiness…

28cf7ba

…, change naming for otel metrics using metadata.name fieldRef

fix chart version

7c29f3c

royischoss commented Apr 20, 2026

View reviewed changes

Comment thread charts/mlrun-ce/templates/_helpers.tpl

royischoss and others added 6 commits April 20, 2026 13:52

Merge branch 'development' into ceml-641

35bbe7d

rename crd-readiness-job.yaml to otel-cr-installer.yaml

5c040f6

Merge remote-tracking branch 'roy/ceml-641' into ceml-641

2c05daf

fix explosion naming per pod to metrics

8268cbb

fix crds creation in otel cr installer

00840a9

davesh0812 approved these changes Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OTEL] Add OpenTelemetry observability support#285

[OTEL] Add OpenTelemetry observability support#285
royischoss wants to merge 35 commits intomlrun:developmentfrom
royischoss:ceml-641

royischoss commented Apr 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davesh0812 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

royischoss commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's implemented

OTel operator sub-chart

templates/opentelemetry/

Metrics pipeline: push model (OTLP to Prometheus)

Instrumentation CR (Python auto-instrumentation)

Nuclio function pods

MLRun API — PYTHONPATH workaround

tests/package.sh

Admin / non-admin split

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davesh0812 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

royischoss commented Apr 5, 2026 •

edited

Loading